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Abstract. In complex scale-free networks, ranking the individual nodes based upon their importance has 
useful applications, such as the identification of hubs for epidemic control, or bottlenecks for controlling 
traffic congestion. However, in most real situations, only limited sub-structures of entire networks are 
available, and therefore the reliability of the order relationships in sampled networks requires investigation. 
With a set of randomly sampled nodes from the underlying original networks, we rank individual nodes by 
three centrality measures: degree, betweenness, and closeness. The higher-ranking nodes from the sampled 
networks provide a relatively better characterisation of their ranks in the original networks than the lower- 
ranking nodes. A closeness-based order relationship is more reliable than any other quantity, due to the 
global nature of the closeness measure. In addition, we show that if access to hubs is limited during the 
sampling process, an increase in the sampling fraction can in fact decrease the sampling accuracy. Finally, 
an estimation method for assessing sampling accuracy is suggested. 
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1 Introduction 

In recent years, there has been great interest in examin- 
ing the properties of complex networks such as the World 
Wide Web, the Internet, and social and biological net- 
works pp. Recent research on the networks reveals that 
many networks have scale-free structures that possess a 
right-skewed degree distribution. This power-law degree 
distribution guarantees a noticeable existence of nodes, or 
hubs, that have a very large number of connections com- 
pared with the average node. The essential role of hubs 
in networks is widely recognised in the contexts of im- 
munisation in epidemic spreading [? , the formation of so- 
cial trends 3 , finding drug targets on biological molecules 
|4l5j . and optimal path finding strategy [6 . For example, 
the study of the spread of viruses on the Internet shows 
that targeting immunisation on hubs drastically reduces 
the occurrence of endemic states, even with a very low 
immunised fraction, whereas uniform immunisation does 
not lead to a drastic reduction in the infection prevalence 
|ll2j . For drug target identification in biological systems, 
the likelihood that removal of a protein will be lethal cor- 
relates strongly with the number of connections to that 
protein in the protein-protein interaction network [5]. 

In these examples, accurate identification of the im- 
portant nodes, i.e. hubs, is an efficient way to resolve the 
specified problems. Such identification, however, requires 
that the ranks of the individual nodes are known, based on 
their importance and contribution to the entire network. 
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In targeted immunisation on hubs, the more accurate a 
policy is at identifying the ranks, the smaller the number 
of necessary cures [7 . In real situations, however, only 
part of the information on the underlying networks can 
be exploited, due to severe physical and economical con- 
straints |6|8|9|10|ll|12j . For example, a survey of relation- 
ships among participants has to be conducted in order to 
construct a social network, but the collected network data 
might be incomplete, since surveys usually target only a 
limited sample of the whole population. Therefore, the 
statistical properties of a network must frequently be as- 
sessed without complete knowledge of global information 
on the entire network. Nevertheless, the sampling problem 
in complex networks has not yet been extensively explored 
[10|12|13] . despite the substantial interest in the commu- 
nity of social network analysis [S] . 



Given that only partial information on a network can 
be obtained, it is worth investigating how accurately the 
importance of a node, based on only partial information, 
refiects the actual importance of the node in the original 
network. For successful epidemic control, it is important 
to determine whether or not the hubs identified as criti- 
cal from the incomplete data remain so even after adding 
supplementary data |14j . The study of rank reliability in 
sampled networks can also be applied to many technolog- 
ical and biological systems, and avoids possible artifacts 
depending on a specific numerical scale of data (whereas 
stretching or compressing the scale does not alter a rank- 
based result). 
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In the present work, we analyse the Barabasi-Albert 
(BA) model as the prototype example of a scale-free net- 
work [1] Q, which allows us to clearly discriminate the 
contribution of the power-law degree distribution to the 
sampling effect from the contribution of additional spe- 
cific biases that appear in other networks. Furthermore, 
to consider realistic effects that are disregarded in the BA 
model, we also analyse several real networks, such as the 
Los Alamos e-Print Archive coauthorship network [15] , the 
Internet AS [16j , and protein-protein interaction networks 
|5ll7j . We concentrate only on cases where the accessible 
information on the networks is limited to the connectiv- 
ity between randomly sampled nodes, although in reality, 
there are other kinds of allowable information, including 
the connectivity from snowball sampling, and that from 
randomly sampled links ^13j . Snowball samples consist of 
identified nodes to which all linked nodes are then used 
to refer to other nodes, and are usually employed by Web 
search engines. Randomly sampled links describe the ran- 
domly gathered connectivity between nodes, e.g. in the 
case of poorly gathered contact information between pa- 
tients. It is expected that snowball sampling provides rare 
sampling biases with literally conserved topologies during 
the sampling, while the possible nontrivial results from 
randomly sampled links can be sufficiently analogous to 
those from randomly sampled nodes with some correspon- 
dence between them |13| . Thus, the focus on randomly 
sampled nodes could be considered a reasonable step to- 
wards investigating the network-sampling problems, al- 
though the study of only randomly sampled nodes here 
has limitations for understanding more specific problems 
in real situations. In this regard, the possible deviations 
between our results and reality could be further reduced 
by additional investigation of different sampling schemes. 

2 Measured quantities 

In sampled or entire networks, individual nodes can be 
properly ranked according to their importance or pres- 
tige [IH], like degree. With the set of sampled nodes, we 
first define a measure for the rank correlation between 
the sampled nodes and the nodes in the original network, 
denoted by t, which is a variant of Kendall's Tau [19] , 
representing how faithfully the rank order is preserved in 
the sampled network. For an arbitrary pair of sampled 
nodes {i, j}, the nodes have the assigned importance, like 
degrees, such as {ki,kj} in the sampled network and as 
{k°^k°} in the original network. If ki < kj (ki > kj) and 
k° < k° [k° > kj), or ki = kj and k° — k°, we consider 
that the pair is then ordered similarly in the sampled and 
original networks. Otherwise, it is regarded as ordered dis- 
similarly. To quantify the preservability of rank order, the 
dominance of pairs ordered similarly in both the sampled 
and original networks is normalised by the total number 
of pairs that are considered in the calculation, through 
T = (iV+ - N^)/{N+ + iV_), where N+ is the number 

^ Performing the analysis on the configuration model [23] 
instead of the BA model does not alter the current results. 



of pairs ordered similarly for sampled and original net- 
works, and A'^_ is the number of pairs that are ordered 
dissimilarly, r can have a value from —1 to 1, indicating 
complete disagreement and full agreement, respectively. 
Without any tied ranks, if the ranks are more preserved 
in sampling than expected by random shuffling, r is posi- 
tive. For the probability p that an arbitrary pair is ordered 
similarly, we can obtain the relationship p = {t -\- l)/2. 

Because the statistical properties of many real net- 
works follow a universal characteristic like a power-law 
distribution, their preservability in sampled networks has 
been of basic interest in previous studies |10ll3j . These 
statistical properties, however, are rarely affected by inter- 
changing the prestiges of nodes. Hence, it is worth compar- 
ing the preservability of these individual- prestige-insensitive 
properties in sampled networks to that of the individual- 
prestige-sensitive properties such as r. Therefore, we in- 
troduce another complementary measure, p, which repre- 
sents the similarity between two probability distributions 
of centrality - one from sampled nodes and the other from 
the original network - where the latter one obeys a power 
law. First, we obtain the cumulative distribution of vari- 
able fc, Vs{k) from the sampled nodes, and Vo{k) from 
the original network. Using ki of the ith sampled node, we 
find fc° satisfying that Vsih) = Vo{kf), and calculate the 
Pearson correlation p between ki and fc° for i = 1, 2, . . . , 
where N is the number of sampled nodes, p can achieve its 
maximum value 1 if ki is proportional to k°. This means 
that when Vs{k) oc fc^" and Vo{k) oc fc^^, p can achieve 
its maximum value 1 if a = /3, i.e. in the case of identical 
power-law distributions. By applying proper normalisa- 
tion, we transform the measure p so as to take a value 
from to 1 in its significant range [3. p gives the preserv- 
ability of probability distributions rather than that of the 
node rank, thus p can have a large value under the simi- 
lar probability distributions, even if the ranks themselves 
are severely altered. In a practical sense, it is possible to 
directly evaluate the exponent difference of the power-law 
distribution between sampled and original networks, and 
the detail of the results exhibits some notable properties, 
including a slight overestimation of the exponents during 
the random sampling |13| . For the degree distribution of 
the BA model, a sampling overestimation of the exponent 
by a factor of 1.2 corresponds to p = 0.8. It should be 
noted that an isolated node, which had no links to the 
other connected sampled nodes, was excluded in the cal- 
culation of r and p. We can easily apply these measures 
to other quantities, such as betweenness centrality, as will 
be shown below. 



^ To this end, we calculate the Pearson correlation pth be- 
tween ki and k° as if Vs{k) is a simple linear function of k. We 
finally obtain the value of p as p — > (p — pth)/ (1 — pth)- There- 
fore, p becomes positive if the probability distribution from the 
sampled nodes resembles that from the original network more 
than a simple linear curve does. 
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3 Simulation and results 

In this paper, we rank-order individual nodes using the 
three centrality measures of complex networks; degree, be- 
tweenness, and closeness, in order to calculate r [15|20j 0. 
We also calculate p for degree and betweenness, based 
on their power-law statistics |l|21j . Using randomly sam- 
pled nodes, Figure [1] displays the result for the BA model, 
which reflects results typical for other real networks with 
regard to the qualitative distinction between r and p. In 
Figure[Tl as the sampling fraction increases, r grows grad- 
ually while p grows quickly and saturates at 1 0. It has 
been verified that the early saturation of p is due to the 
overall proportional relation between the centrality mea- 
sure obtained from the randomly sampled nodes and that 
from the original networks ^3j. On the other hand, the 
continuous and rather slow growth of r indicates the sen- 
sitivity of individual-level prestige to the sampling, espe- 
cially for the low rank nodes, as will be presented below. 

Interestingly, the contribution of an individual node to 
the value of r is not uniform over all nodes, and strongly 
depends on the rank of the node. To examine this prop- 
erty in detail, we divide the sampled nodes into the sub- 
groups according to their individual ranks in the sampled 
nodes. For example, in the case of degree-based ranks, 
each node would belong to one of 10 groups - the high- 
est 0~10%, 10~20%, . . . , 90-100% ranks - in descending 
order of degree. To obtain the contribution to r made by 
each group, we calculate t over pairs of nodes {i, j} where 
the zth node is a member of the given group, and the 
jih. node is a member of any group. Figure [2^ illustrates 
the result for the BA model; the groups of higher-ranking 
nodes have large r's, indicating that the higher-ranking 
nodes of the sampled nodes provide better characterisa- 
tion of their ranks in the original networks [llj . This point 
will be universally carried in scale- free networks, because 
the nodes of large degree hardly face the shuffling of their 
ranks in sampling due to their relatively small population. 
In the Erdos-Renyi model, the intermediate ranks com- 
prise a greater proportion of the population than either 
the high or low ranks. It is expected that t would reach 
the minimum value with intermediate ranks, as observed 
in Figure [21d. 

From observations in the BA model, we have found 
that r's for betweenness and closeness have larger values 
than T for degree, except in a few of the highest rank 
groups. Even in these highest rank groups, r for degree is 
comparable to the other r's, and does not dominate them. 
In an attempt to explain the smallness of t for degree in 
most groups, one might consider the discreteness effect of 
degree, e.g. the majority of the nodes would possess a de- 
gree of 1 in a small sampling fraction. This severe discrete- 
ness could hide the original ordinal information between 

^ Closeness of the zth node is defined as the average of the 
reciprocal distances from the ith node to all other nodes. 

Randomly sampled nodes are inevitably composed of sev- 
eral disconnected clusters. Nonetheless, the main difference be- 
tween r and p is still valid even if we consider only the largest 
component from these clusters. 
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Fig. 1. The horizontal axis represents the sampling fraction 
(the number of sampled nodes divided by the number of total 
nodes), while the vertical axis represents the calculated r and 
p at each sampling fraction. We use the BA model with 30000 
nodes and average degree of 8. 



the nodes, thus leading to a smaller value of r. Neverthe- 
less, the discreteness effect does not sufficiently explain our 
observation. To clarify this point, we calculate t while ex- 
cluding the pairs of sampled nodes in tied prestiges, which 
reduces the discreteness effect. Figures [2fc and[2ji show the 
results of sampling fractions of 40% and 60%, respectively, 
but well represent the generic consequence along all sam- 
pling fractions. Although t for degree becomes large in a 
small sampling fraction (see Fig. [2};), the similar feature 
in Figure [2^ eventually recovers as the sampling fraction 
increases (see Fig. [2Ji) . This result implies that the small 
value of T for degree can be attributable to the intrinsic 
property of the local centrality, by which individual pres- 
tige is sensitive to the random sampling due to the local 
fluctuation of the network topology. 

For comparison with the BA model, we consider real 
networks, and observe some different results. In real net- 
works, r for betweenness becomes suppressed and is no 
longer comparable to t for closeness (see Fig.|3^). Here, we 
present the case of the Los Alamos e-Print Archive coau- 
thorship network, although similar results are observed 
in other real networks, including the Internet AS and 
protein-protein interaction networks. 

The suppressed r for betweenness reflects the sensitiv- 
ity of the betweenness measure to the network modularity. 
Unlike random networks including the BA model, many 
real networks have structural sub-units, namely modular 
structures, that significantly affect the centrality measures 
in unexpected ways in random networks. For example, the 
presence of nodes with small degree and large between- 
ness shown in Figure [3)d indicates the existence of loose 
connections between tightly-knit modules [22], such that 
the nodes on the loose connections that bear a consider- 
able number of inter-modular communication paths ex- 
hibit large betweenness centrality despite their small de- 
gree. In this sense, during the random sampling, violat- 
ing modularity in the networks can significantly alter the 
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Fig. 2. The horizontal axis represents the groups according to 
their constituting ranks in sampled nodes, while the vertical 
axis represents r for each group, (a) 40% sampled from the 
BA model with 30000 nodes and an average degree of 8; (b) 
the same condition as (a) in the Erdos-Renyi model; (c) r 
calculated without counting the pairs in tied prestiges from 
(a); (d) increased sampling fraction with 60% from (c). 



betweenness-based node prestige, thereby lowering r for 
betweenness. One way to confirm this effect is to observe 
what happens if the modularity effect is reduced. To dis- 
card the modularity effect, we sample only the nodes with 
highly correlated degree and betweenness rather than do 
random sampling, and calculate the corresponding t 0. 
Under the reduced- modularity effect, we can identify the 
range of the sampling fraction (in the Archive coauthor- 
ship, ^50%) in which r for betweenness becomes compa- 
rable to T for closeness as in the BA model (see Figs. [3t 
and[5Jl). This shows that the modularity effect is indeed 
essential to the suppression of r for betweenness. 

Consequently, the r for each centrality measure relies 
on the sensitivity of the centrality measure to the sam- 
pling. Indeed, a small t for degree comes from the fact 
that the ranks of degree suffer from their shuffling due to 
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Fig. 3. (a) Rank vs. r in the 30% sampled nodes from the 
Archive coauthorship. (b) Individual degree vs. betweenness in 
the Archive coauthorship. (c) Individual degree vs. between- 
ness in the highest 30% of highly correlated degree and be- 
tweenness. (d) Rank vs. t of the nodes from (c). 



the local fluctuation of topology during the sampling pro- 
cess. Although it is based upon global information on the 
networks, betweenness concerns the number of shortest 
paths across a node itself, thus the rank can be sensitive 
to the topological variation in the proximity of the node, 
and especially to modular-level fluctuations. On the other 
hand, closeness is relatively tolerant to such topological 
fluctuations, and contributed to by the robust global in- 
formation of the network, averaged path lengths outward 
from a node. Therefore, the closeness-based rank order 
possesses a larger t than any other quantity, due to the 
unique globality of the closeness being insensitive to the 
sampling. 

Because such a global characteristic of closeness is re- 
sponsible for the large r for closeness, the value of r for 
closeness can be suppressed if access to the hubs that bind 
the network together globally is restricted in the sampling 
process. To simplify this situation, we sample the nodes in 
ascending order of their centrality measures, rather than 
randomly as presented before. Figures [4^-[4l; display the 
results gathered when nodes are selected in ascending or- 
der of degree, and similar results are produced for the 
cases of betweenness and closeness. As discussed above, 
the value of r for closeness is no longer superior to any 
other quantity. Surprisingly, we further identify that in 
real networks, r obtains its minimum value in an interme- 
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diate range of the sampling fraction, and thus has a convex 
shape (see Fig.|4lD). This directly indicates that with small 
sampling fractions, if access to hubs is limited, an increase 
in the sampling fraction (i.e., more nodes are sampled) can 
in fact decrease the sampling accuracy (small r) without 
a gain in valuable information. To avoid this type of er- 
ror in the analysis of social networks, a sufficient sampling 
size of social individuals must be assured when access to 
the central leadership is restricted. Also, for the study of 
small data sets in bioinformatics, the presence of hubs 
should be of concern because if they are not available, the 
ordinal information extracted from the small data set is 
not reliable. This exotic behaviour from real networks is 
essentially caused by the properties of the degree distri- 
bution of real networks rather than by other structural 
properties embedded in real networks, e.g. the modular- 
ity. Figure ID: exhibits the results for the random networks 
given the same degree distribution as that of the real net- 
works [23 , which produce a feature similar to that shown 
in Figure [4)3. 

For predictive purposes, is it possible to presume the 
value of r for nodes sampled randomly from entire net- 
works? In real situations, since the information available 
to us is that of sampled networks rather than that of en- 
tire networks, we can only evaluate the r of the nodes 
against a priori sampled networks, but not against the 
entire networks, which are rarely achievable. Despite such 
limitations, we can exploit the r of the nodes sampled 
from these sampled networks to approximate that from 
the entire networks. In random sampling, since decreasing 
the sampling fraction makes the network more homoge- 
neous, with small degrees [13], it is expected that the r 
of the subset in randomly sampled nodes underestimates 
that of the subset in the entire network with the same 
sampling fraction. For the BA model. Figures [4ji-{4f es- 
tablish the corresponding tendency manifested especially 
in low sampling fractions, which is consistently revealed 
in the cases of other real networks only except for the 
betweenness of the Internet AS 0. In this regard, we can 
use this underestimation to approximate the actual r of 
an arbitrary sampling fraction for the entire network by 
providing its lower bound. For example, in the case of the 
protein-protein interaction network, r for the degree of 
30% sampled nodes in our data is equal to 0.35, which 
means that the r for 30% sampled nodes in a complete 
data set would be greater than 0.35 if the sampling method 
is close to random node sampling. Likewise, for 30% sam- 
pled nodes in the Archive coauthorship, r for degree is 
equal to 0.45, thereby indicating that r would be larger 
than 0.45 for the 30% sampled nodes in the complete data 
set. 



For the betweenness of the Internet AS, the tendency be- 
comes reversed such that r of the subset in randomly sampled 
nodes overestimates that of the subset in the entire network 
with the same sampling fraction. 
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Fig. 4. (a)-(c) Sampling under limited hub-accessibility. The 
horizontal axis represents the sampling fraction in ascending 
order of degree, while the vertical axis represents r at each 
sampling fraction, for (a) the BA model with 30000 nodes and 
average degree of 8; (b) the Archive coauthorship; and (c) a 
random network given the same degree distribution as that of 
the Archive coauthorship. Results similar to those shown in (b) 
and (c) are also shown for other real networks, (d)-(f ) Sampling 
from randomly sampled networks. The horizontal axis for a 
sampling fraction out of 100%-, 53%-, and 27%-sampled nodes 
from the BA model with 30000 nodes and average degree of 8, 
and the vertical axis for r at each sampling fraction, (d) r for 
degree, (e) r for betweenness. (f) t for closeness. 



4 Conclusions 

In summary, we have investigated the accuracy of order re- 
lationships in sampled networks, and found that the prop- 
erties of complex networks, such as degree heterogeneity 
and structural modularity, are responsible for the vari- 
ous results. The higher-ranking nodes in sampled networks 
preserve their positions in the original networks more ro- 
bustly than the lower-ranking nodes, and the closeness- 
based order relationship gives the best measure for faithful 
ordinal information in sampled networks. Interestingly, we 
discovered that limiting the access to hubs during the sam- 
pling can in fact decrease the accuracy of the sampling as 
the sampling fraction increases. We emphasise the role of 
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hubs in characterising a sampled network, and the effect of 
the perturbed scale of the network, to which each central- 
ity measure responds sensitively. Beyond these analyses, 
a methodology providing the lower bound for sampling 
accuracy is suggested. Our results can be helpful for un- 
derstanding the properties of sampled networks, especially 
for social and criminal networks, for which analysis suffers 
from various types of sampling error and other limitations 
|8ll2l24j . The sampling problems in complex networks, in- 
cluding the detection of errors in power-law statistics and 
the suggestion of useful sampling protocols, are currently 
being explored |10|12|13j . 
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